Skip to content

Align UR_API fields (8 byte) for optimization create/move/copy structs on x64 cpus #17683

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

GermanAizek
Copy link
Contributor

PR migrated from oneapi-src/unified-runtime#2747

Would you like to organize migration to aligned structures your Unified Runtime API for modern x64 processors? This should be guaranteed to lead to more frequent structs entry into CPU cache, which can greatly affect performance if aligned structures are used frequently. I hope that you (as Intel employees) know advantage this optimization method at the architectural level codebase and its inconveniences as stylistic ABI breakdown.

Very briefly, your API is not badly broken, I have changed only first and second field in all aligned structures.
(pNext and sType)

More info about technique:
https://stackoverflow.com/a/20882083
https://zijishi.xyz/post/optimization-technique/learning-to-use-data-alignment/
https://en.wikipedia.org/wiki/Data_structure_alignment

Affected API structures:

  • ur_image_desc_t 80 -> 72 bytes
  • ur_exp_command_buffer_update_value_arg_desc_t 48 -> 40 bytes
  • ur_exp_command_buffer_update_memobj_arg_desc_t 40 -> 32 bytes
  • ur_exp_command_buffer_update_pointer_arg_desc_t 40 -> 32 bytes
  • ur_sampler_desc_t 32 -> 24 bytes
  • ur_program_properties_t 32 -> 24 bytes
  • ur_exp_sampler_addr_modes_t 32 -> 24 bytes
  • ur_platform_native_properties_t 24 -> 16 bytes
  • ur_device_native_properties_t 24 -> 16 bytes
  • ur_context_properties_t 24 -> 16 bytes
  • ur_context_native_properties_t 24 -> 16 bytes
  • ur_buffer_channel_properties_t 24 -> 16 bytes
  • ur_buffer_alloc_location_properties_t 24 -> 16 bytes
  • ur_mem_native_properties_t 24 -> 16 bytes
  • ur_sampler_native_properties_t 24 -> 16 bytes
  • ur_usm_host_desc_t 24 -> 16 bytes
  • ur_usm_device_desc_t 24 -> 16 bytes
  • ur_usm_alloc_location_desc_t 24 -> 16 bytes
  • ur_usm_pool_desc_t 24 -> 16 bytes
  • ur_physical_mem_properties_t 24 -> 16 bytes
  • ur_program_native_properties_t 24 -> 16 bytes
  • ur_kernel_arg_mem_obj_properties_t 24 -> 16 bytes
  • ur_kernel_native_properties_t 24 -> 16 bytes
  • ur_queue_properties_t 24 -> 16 bytes
  • ur_queue_index_properties_t 24 -> 16 bytes
  • ur_queue_native_properties_t 24 -> 16 bytes
  • ur_event_native_properties_t 24 -> 16 bytes
  • ur_exp_async_usm_alloc_properties_t 24 -> 16 bytes
  • ur_exp_file_descriptor_t 24 -> 16 bytes
  • ur_exp_sampler_cubemap_properties_t 24 -> 16 bytes
  • ur_exp_command_buffer_desc_t 24 -> 16 bytes
  • ur_exp_enqueue_ext_properties_t 24 -> 16 bytes
  • ur_exp_enqueue_native_command_properties_t 24 -> 16 bytes

…s on x64 cpus

Affected API structures:
- ur_image_desc_t 80 -> 72 bytes
- ur_exp_command_buffer_update_value_arg_desc_t 48 -> 40 bytes
- ur_exp_command_buffer_update_memobj_arg_desc_t 40 -> 32 bytes
- ur_exp_command_buffer_update_pointer_arg_desc_t 40 -> 32 bytes
- ur_sampler_desc_t 32 -> 24 bytes
- ur_program_properties_t 32 -> 24 bytes
- ur_exp_sampler_addr_modes_t 32 -> 24 bytes
- ur_platform_native_properties_t 24 -> 16 bytes
- ur_device_native_properties_t 24 -> 16 bytes
- ur_context_properties_t 24 -> 16 bytes
- ur_context_native_properties_t 24 -> 16 bytes
- ur_buffer_channel_properties_t 24 -> 16 bytes
- ur_buffer_alloc_location_properties_t 24 -> 16 bytes
- ur_mem_native_properties_t 24 -> 16 bytes
- ur_sampler_native_properties_t 24 -> 16 bytes
- ur_usm_host_desc_t 24 -> 16 bytes
- ur_usm_device_desc_t 24 -> 16 bytes
- ur_usm_alloc_location_desc_t 24 -> 16 bytes
- ur_usm_pool_desc_t 24 -> 16 bytes
- ur_physical_mem_properties_t 24 -> 16 bytes
- ur_program_native_properties_t 24 -> 16 bytes
- ur_kernel_arg_mem_obj_properties_t 24 -> 16 bytes
- ur_kernel_native_properties_t 24 -> 16 bytes
- ur_queue_properties_t 24 -> 16 bytes
- ur_queue_index_properties_t 24 -> 16 bytes
- ur_queue_native_properties_t 24 -> 16 bytes
- ur_event_native_properties_t 24 -> 16 bytes
- ur_exp_async_usm_alloc_properties_t 24 -> 16 bytes
- ur_exp_file_descriptor_t 24 -> 16 bytes
- ur_exp_sampler_cubemap_properties_t 24 -> 16 bytes
- ur_exp_command_buffer_desc_t 24 -> 16 bytes
- ur_exp_enqueue_ext_properties_t 24 -> 16 bytes
- ur_exp_enqueue_native_command_properties_t 24 -> 16 bytes

Signed-off-by: Herman Semenov <[email protected]>
@GermanAizek GermanAizek requested a review from a team as a code owner March 27, 2025 14:55
@aarongreig
Copy link
Contributor

Thanks for the contribution, we're discussing this internally.

@kbenzie
Copy link
Contributor

kbenzie commented Apr 21, 2025

While the contribution is appriciated we will not be accepting it at this time for the following reasons:

  • The order of data members is intentional, stype is a tag and pNext creates a linked list of structs used in the creating of UR handles
    • When processing a chain of pNexts the pointer to the struct can be cast to a pointer to the first element in the struct, the enum type, to determine how to access the data in a valid way, swaping stype and pNext would require reading sizeof(void*) bytes past the start of the struct instead
    • This pattern enables an extensible API without the need to break ABI
    • This is a pattern used in other API's such as Vulkan and Level Zero
  • The structs in question are only used during object creation which is not on the hot path
    • The structs are short lived objects more commonly stored on the stack before calling an object creation entry point after which they will go out of scope
    • The structs will generally not be stored in a large array negating the size benefits and cache behavior you may see when processing them in a hot loop
  • Claims about performance not backed up with benchmarks numbers are meaningless, evidence of performance improvements is essential and must include details about the platform used to run the benchmarks and steps to reproduce those results including the benchmarks used to create them

@kbenzie kbenzie closed this Apr 21, 2025
@GermanAizek
Copy link
Contributor Author

@kbenzie,

While the contribution is appriciated we will not be accepting it at this time for the following reasons:

* The order of data members is intentional, `stype` is a tag and `pNext` creates a linked list of structs used in the creating of UR handles

Alternatively, can use __attribute__((__packed__)) keyword for structures, this does not spoil the order of the structure fields.

Regarding testing, I would test it, but I do not know where intel/llvm have benchmarks and how to build them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants